{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.092204,
     "end_time": "2020-06-23T18:39:21.723375",
     "exception": false,
     "start_time": "2020-06-23T18:39:21.631171",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Cleaning NOAA Weather Data of JFK Airport (New York)\n",
    "\n",
    "This notebook relates to the NOAA Weather Dataset - JFK Airport (New York). The dataset contains 114,546 hourly observations of 12 local climatological variables (such as temperature and wind speed) collected at JFK airport. This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/jfk-weather-data/).\n",
    "\n",
    "In this notebook, we clean the raw dataset by:\n",
    "* removing redundant columns and preserving only key numeric columns\n",
    "* converting and cleaning data where required\n",
    "* creating a fixed time interval between observations (this aids with later time-series analysis)\n",
    "* filling missing values\n",
    "* encoding certain weather features\n",
    "\n",
    "### Table of Contents:\n",
    "* [1. Read the Raw Data](#cell1)\n",
    "* [2. Clean the Data](#cell2)\n",
    "  * [2.1 Select data columns](#cell3)\n",
    "  * [2.2 Clean up precipitation column](#cell4)\n",
    "  * [2.3 Convert columns to numerical types](#cell5)\n",
    "  * [2.4 Reformat and process data](#cell6)\n",
    "  * [2.5 Create a fixed interval dataset](#cell7)\n",
    "  * [2.6 Feature encoding](#cell8)\n",
    "  * [2.7 Rename columns](#cell9)\n",
    "* [3. Save the Cleaned Data](#cell10)\n",
    "* [Authors](#authors)\n",
    "\n",
    "#### Import required modules\n",
    "\n",
    "Import and configure the required modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:21.821927Z",
     "iopub.status.busy": "2020-06-23T18:39:21.816084Z",
     "iopub.status.idle": "2020-06-23T18:39:30.075645Z",
     "shell.execute_reply": "2020-06-23T18:39:30.074424Z"
    },
    "papermill": {
     "duration": 8.307268,
     "end_time": "2020-06-23T18:39:30.075908",
     "exception": false,
     "start_time": "2020-06-23T18:39:21.768640",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install PyGithub pandas > /dev/null 2>&1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:30.182273Z",
     "iopub.status.busy": "2020-06-23T18:39:30.180545Z",
     "iopub.status.idle": "2020-06-23T18:39:31.766005Z",
     "shell.execute_reply": "2020-06-23T18:39:31.766916Z"
    },
    "papermill": {
     "duration": 1.643517,
     "end_time": "2020-06-23T18:39:31.767117",
     "exception": false,
     "start_time": "2020-06-23T18:39:30.123600",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Define required imports\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sys\n",
    "import re\n",
    "# These set pandas max column and row display in the notebook\n",
    "pd.set_option('display.max_columns', 50)\n",
    "pd.set_option('display.max_rows', 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.041501,
     "end_time": "2020-06-23T18:39:31.845060",
     "exception": false,
     "start_time": "2020-06-23T18:39:31.803559",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell1\"></a>\n",
    "\n",
    "### 1. Read the Raw Data\n",
    "\n",
    "We start by reading in the raw dataset, displaying the first few rows of the dataframe, and taking a look at the columns and column types present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:31.955151Z",
     "iopub.status.busy": "2020-06-23T18:39:31.953784Z",
     "iopub.status.idle": "2020-06-23T18:39:44.415654Z",
     "shell.execute_reply": "2020-06-23T18:39:44.414375Z"
    },
    "papermill": {
     "duration": 12.521019,
     "end_time": "2020-06-23T18:39:44.415882",
     "exception": false,
     "start_time": "2020-06-23T18:39:31.894863",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "raw_data = pd.read_csv('data/noaa-weather-data-jfk-airport/jfk_weather.csv', parse_dates=['DATE'])\n",
    "raw_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:44.592471Z",
     "iopub.status.busy": "2020-06-23T18:39:44.591261Z",
     "iopub.status.idle": "2020-06-23T18:39:44.605889Z",
     "shell.execute_reply": "2020-06-23T18:39:44.606756Z"
    },
    "papermill": {
     "duration": 0.111785,
     "end_time": "2020-06-23T18:39:44.607034",
     "exception": false,
     "start_time": "2020-06-23T18:39:44.495249",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "raw_data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.05262,
     "end_time": "2020-06-23T18:39:44.711391",
     "exception": false,
     "start_time": "2020-06-23T18:39:44.658771",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell2\"></a>\n",
    "\n",
    "### 2. Clean the Data\n",
    "\n",
    "As you can see above, there are a lot of fields which are non-numerical - usually these will be fields that contain text or categorical data, e.g. `HOURLYSKYCONDITIONS`.\n",
    "\n",
    "There are also fields - such as the main temperature field of interest `HOURLYDRYBULBTEMPF` - that we expect to be numerical, but are instead `object` type. This often indicates that there may be missing (or `null`) values, or some other unusual readings that we may have to deal with (since otherwise the field would have been fully parsed as a numerical data type).\n",
    "\n",
    "In addition, some fields relate to hourly observations, while others relate to daily or monthly intervals. For purposes of later exploratory data analysis, we will restrict the dataset to a certain subset  of numerical fields that relate to hourly observations.\n",
    "\n",
    "In this section, we refer to the [NOAA Local Climatological Data Documentation](https://data.noaa.gov/dataset/dataset/u-s-local-climatological-data-lcd/resource/ee7381ea-647a-434f-8cfa-81202b9b4c05) to describe the fields and meaning of various values.\n",
    "\n",
    "<a id=\"cell3\"></a>\n",
    "#### 2.1 Select data columns\n",
    "\n",
    "First, we select only the subset of data columns of interest and inspect the column types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:44.835163Z",
     "iopub.status.busy": "2020-06-23T18:39:44.833815Z",
     "iopub.status.idle": "2020-06-23T18:39:45.071289Z",
     "shell.execute_reply": "2020-06-23T18:39:45.069263Z"
    },
    "papermill": {
     "duration": 0.310215,
     "end_time": "2020-06-23T18:39:45.071537",
     "exception": false,
     "start_time": "2020-06-23T18:39:44.761322",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Choose what columns to import from raw data\n",
    "column_subset = [\n",
    "    'DATE',\n",
    "    'HOURLYVISIBILITY',\n",
    "    'HOURLYDRYBULBTEMPF',\n",
    "    'HOURLYWETBULBTEMPF',\n",
    "    'HOURLYDewPointTempF',\n",
    "    'HOURLYRelativeHumidity',\n",
    "    'HOURLYWindSpeed',\n",
    "    'HOURLYWindDirection',\n",
    "    'HOURLYStationPressure',\n",
    "    'HOURLYPressureTendency',\n",
    "    'HOURLYSeaLevelPressure',\n",
    "    'HOURLYPrecip',\n",
    "    'HOURLYAltimeterSetting'\n",
    "]\n",
    "\n",
    "# Filter dataset to relevant columns\n",
    "hourly_data = raw_data[column_subset]\n",
    "# Set date index\n",
    "hourly_data = hourly_data.set_index(pd.DatetimeIndex(hourly_data['DATE']))\n",
    "hourly_data.drop(['DATE'], axis=1, inplace=True)\n",
    "hourly_data.replace(to_replace='*', value=np.nan, inplace=True)\n",
    "hourly_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:45.188028Z",
     "iopub.status.busy": "2020-06-23T18:39:45.186640Z",
     "iopub.status.idle": "2020-06-23T18:39:45.191713Z",
     "shell.execute_reply": "2020-06-23T18:39:45.192864Z"
    },
    "papermill": {
     "duration": 0.075418,
     "end_time": "2020-06-23T18:39:45.193224",
     "exception": false,
     "start_time": "2020-06-23T18:39:45.117806",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "hourly_data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.038803,
     "end_time": "2020-06-23T18:39:45.296888",
     "exception": false,
     "start_time": "2020-06-23T18:39:45.258085",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell4\"></a>\n",
    "#### 2.2 Clean up precipitation column\n",
    "\n",
    "From the dataframe preview above, we can see that the column `HOURLYPrecip` - which is the hourly measure of precipitation levels - contains both `NaN` and `T` values. `T` specifies *trace amounts of precipitation*, while `NaN` means *not a number*, and is used to denote missing values.\n",
    "\n",
    "We can also inspect the unique values present for the field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:45.398987Z",
     "iopub.status.busy": "2020-06-23T18:39:45.397966Z",
     "iopub.status.idle": "2020-06-23T18:39:45.411344Z",
     "shell.execute_reply": "2020-06-23T18:39:45.412295Z"
    },
    "papermill": {
     "duration": 0.075263,
     "end_time": "2020-06-23T18:39:45.412578",
     "exception": false,
     "start_time": "2020-06-23T18:39:45.337315",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "hourly_data['HOURLYPrecip'].unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.050967,
     "end_time": "2020-06-23T18:39:45.515397",
     "exception": false,
     "start_time": "2020-06-23T18:39:45.464430",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We can see that some values end with an `s` (indicating snow), while there is a strange value `0.020.01s` which appears to be an error of some sort. To deal with `T` values, we will set the observation to be `0`. We will also replace the erroneous value `0.020.01s` with `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:45.637889Z",
     "iopub.status.busy": "2020-06-23T18:39:45.637111Z",
     "iopub.status.idle": "2020-06-23T18:39:45.641781Z",
     "shell.execute_reply": "2020-06-23T18:39:45.642603Z"
    },
    "papermill": {
     "duration": 0.068251,
     "end_time": "2020-06-23T18:39:45.642901",
     "exception": false,
     "start_time": "2020-06-23T18:39:45.574650",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Fix imported data\n",
    "hourly_data['HOURLYPrecip'].replace(to_replace='T', value='0.00', inplace=True)\n",
    "hourly_data['HOURLYPrecip'].replace('0.020.01s', np.nan, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.063395,
     "end_time": "2020-06-23T18:39:45.772975",
     "exception": false,
     "start_time": "2020-06-23T18:39:45.709580",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell5\"></a>\n",
    "#### 2.3 Convert columns to numerical types\n",
    "\n",
    "Next, we will convert string columns that refer to numerical values to numerical types. For columns such as `HOURLYPrecip`, we first also drop the non-numerical parts of the value (the `s` character)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:46.014219Z",
     "iopub.status.busy": "2020-06-23T18:39:45.996284Z",
     "iopub.status.idle": "2020-06-23T18:39:48.521724Z",
     "shell.execute_reply": "2020-06-23T18:39:48.523331Z"
    },
    "papermill": {
     "duration": 2.690677,
     "end_time": "2020-06-23T18:39:48.523767",
     "exception": false,
     "start_time": "2020-06-23T18:39:45.833090",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Set of columns to convert\n",
    "messy_columns = column_subset[1:]\n",
    "\n",
    "# Convert columns to float32 datatype\n",
    "for i in messy_columns:\n",
    "    hourly_data[i] = hourly_data[i].apply(lambda x: re.sub('[^0-9,.-]', '', x) if type(x) == str else x).replace('', np.nan).astype(('float32'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.061545,
     "end_time": "2020-06-23T18:39:48.670019",
     "exception": false,
     "start_time": "2020-06-23T18:39:48.608474",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We can now see that all fields have numerical data type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:48.801717Z",
     "iopub.status.busy": "2020-06-23T18:39:48.800331Z",
     "iopub.status.idle": "2020-06-23T18:39:48.876561Z",
     "shell.execute_reply": "2020-06-23T18:39:48.875459Z"
    },
    "papermill": {
     "duration": 0.15388,
     "end_time": "2020-06-23T18:39:48.876939",
     "exception": false,
     "start_time": "2020-06-23T18:39:48.723059",
     "status": "completed"
    },
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(hourly_data.info())\n",
    "print()\n",
    "hourly_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.060749,
     "end_time": "2020-06-23T18:39:49.004064",
     "exception": false,
     "start_time": "2020-06-23T18:39:48.943315",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell6\"></a>\n",
    "#### 2.4 Reformat and process data\n",
    "\n",
    "Next, we will clean up some of the data columns to ensure their values fall within the parameters defined by the NOAA documentation (referred to above). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:49.146937Z",
     "iopub.status.busy": "2020-06-23T18:39:49.145668Z",
     "iopub.status.idle": "2020-06-23T18:39:49.414728Z",
     "shell.execute_reply": "2020-06-23T18:39:49.413260Z"
    },
    "papermill": {
     "duration": 0.348362,
     "end_time": "2020-06-23T18:39:49.415029",
     "exception": false,
     "start_time": "2020-06-23T18:39:49.066667",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Generate the summary statistics for each column\n",
    "hourly_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.071245,
     "end_time": "2020-06-23T18:39:49.539044",
     "exception": false,
     "start_time": "2020-06-23T18:39:49.467799",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "According to the documentation, the `HOURLYPressureTendency` field should be an integer value in the range `[0, 8]`. Let's check if this condition holds for this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:49.698255Z",
     "iopub.status.busy": "2020-06-23T18:39:49.697032Z",
     "iopub.status.idle": "2020-06-23T18:39:49.718265Z",
     "shell.execute_reply": "2020-06-23T18:39:49.719101Z"
    },
    "papermill": {
     "duration": 0.112969,
     "end_time": "2020-06-23T18:39:49.719340",
     "exception": false,
     "start_time": "2020-06-23T18:39:49.606371",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Check if categorical variable HOURLYPressureTendency ever has a non-integer entry outside the bounds of 0-8\n",
    "cond = len(hourly_data[~hourly_data['HOURLYPressureTendency'].isin(list(range(0,9)) + [np.nan])])\n",
    "print('Hourly Pressure Tendency should be between 0 and 8: {}'.format(cond == 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.064188,
     "end_time": "2020-06-23T18:39:49.838842",
     "exception": false,
     "start_time": "2020-06-23T18:39:49.774654",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "The `HOURLYVISIBILITY` should be an integer in the range `[0, 10]`. Let's check this condition too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:50.029099Z",
     "iopub.status.busy": "2020-06-23T18:39:50.026223Z",
     "iopub.status.idle": "2020-06-23T18:39:50.112868Z",
     "shell.execute_reply": "2020-06-23T18:39:50.114327Z"
    },
    "papermill": {
     "duration": 0.188961,
     "end_time": "2020-06-23T18:39:50.114669",
     "exception": false,
     "start_time": "2020-06-23T18:39:49.925708",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Hourly Visibility should be between 0 and 10\n",
    "hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.074652,
     "end_time": "2020-06-23T18:39:50.305437",
     "exception": false,
     "start_time": "2020-06-23T18:39:50.230785",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We find that a couple of observations fall outside the range. These must be spurious data observations and we handle them by replacing them with `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:50.472871Z",
     "iopub.status.busy": "2020-06-23T18:39:50.470644Z",
     "iopub.status.idle": "2020-06-23T18:39:50.494632Z",
     "shell.execute_reply": "2020-06-23T18:39:50.493518Z"
    },
    "papermill": {
     "duration": 0.106781,
     "end_time": "2020-06-23T18:39:50.494873",
     "exception": false,
     "start_time": "2020-06-23T18:39:50.388092",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Replace any hourly visibility figure outside these bounds with nan\n",
    "hourly_data.loc[hourly_data['HOURLYVISIBILITY'] > 10, 'HOURLYVISIBILITY'] = np.nan\n",
    "\n",
    "# Hourly Visibility should be between 0 and 10\n",
    "cond = len(hourly_data[(hourly_data['HOURLYVISIBILITY'] < 0) | (hourly_data['HOURLYVISIBILITY'] > 10)])\n",
    "print('Hourly Visibility should be between 0 and 10: {}'.format(cond == 0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.059017,
     "end_time": "2020-06-23T18:39:50.621964",
     "exception": false,
     "start_time": "2020-06-23T18:39:50.562947",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Finally, we check if there are any duplicates with respect to our `DATE` index and check furthermore that our dates are in the correct order (that is, strictly increasing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:50.758805Z",
     "iopub.status.busy": "2020-06-23T18:39:50.755334Z",
     "iopub.status.idle": "2020-06-23T18:39:50.804285Z",
     "shell.execute_reply": "2020-06-23T18:39:50.789921Z"
    },
    "papermill": {
     "duration": 0.135424,
     "end_time": "2020-06-23T18:39:50.813078",
     "exception": false,
     "start_time": "2020-06-23T18:39:50.677654",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "cond = len(hourly_data[hourly_data.index.duplicated()].sort_index())\n",
    "print('Date index contains no duplicate entries: {}'.format(cond == 0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:50.986922Z",
     "iopub.status.busy": "2020-06-23T18:39:50.985611Z",
     "iopub.status.idle": "2020-06-23T18:39:51.012327Z",
     "shell.execute_reply": "2020-06-23T18:39:51.006631Z"
    },
    "papermill": {
     "duration": 0.119654,
     "end_time": "2020-06-23T18:39:51.012609",
     "exception": false,
     "start_time": "2020-06-23T18:39:50.892955",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Make sure time index is sorted and increasing\n",
    "print('Date index is strictly increasing: {}'.format(hourly_data.index.is_monotonic_increasing))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.108063,
     "end_time": "2020-06-23T18:39:51.209020",
     "exception": false,
     "start_time": "2020-06-23T18:39:51.100957",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell7\"></a>\n",
    "#### 2.5 Create a fixed interval dataset\n",
    "\n",
    "Most time-series analysis requires (or certainly works much better with) data that has fixed measurement intervals. As you may have noticed from the various data samples above, the measurement intervals for this dataset are not exactly hourly. So, we will use `Pandas`' resampling functionality to create a dataset that has exact hourly measurement intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:51.439117Z",
     "iopub.status.busy": "2020-06-23T18:39:51.437935Z",
     "iopub.status.idle": "2020-06-23T18:39:51.540362Z",
     "shell.execute_reply": "2020-06-23T18:39:51.531784Z"
    },
    "papermill": {
     "duration": 0.215191,
     "end_time": "2020-06-23T18:39:51.541396",
     "exception": false,
     "start_time": "2020-06-23T18:39:51.326205",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Resample (downsample) to hourly rows (we're shifting everything up by 9 minutes!)\n",
    "hourly_data = hourly_data.resample('60min').last().shift(periods=1) #Note: use resample('60min', base=51) to resample on the 51st of every hour"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.140495,
     "end_time": "2020-06-23T18:39:51.763022",
     "exception": false,
     "start_time": "2020-06-23T18:39:51.622527",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "We will now also replace missing values. For numerical values, we will linearly interpolate between the previous and next valid obvservations. For the categorical `HOURLYPressureTendency` field, we will replace missing values with the last valid observation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:51.957804Z",
     "iopub.status.busy": "2020-06-23T18:39:51.956207Z",
     "iopub.status.idle": "2020-06-23T18:39:52.093727Z",
     "shell.execute_reply": "2020-06-23T18:39:52.091406Z"
    },
    "papermill": {
     "duration": 0.241565,
     "end_time": "2020-06-23T18:39:52.094112",
     "exception": false,
     "start_time": "2020-06-23T18:39:51.852547",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "hourly_data['HOURLYPressureTendency'] = hourly_data['HOURLYPressureTendency'].fillna(method='ffill') # fill with last valid observation\n",
    "hourly_data = hourly_data.interpolate(method='linear') # interpolate missing values\n",
    "hourly_data.drop(hourly_data.index[0], inplace=True) # drop first row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:52.293143Z",
     "iopub.status.busy": "2020-06-23T18:39:52.291888Z",
     "iopub.status.idle": "2020-06-23T18:39:52.362844Z",
     "shell.execute_reply": "2020-06-23T18:39:52.361870Z"
    },
    "papermill": {
     "duration": 0.184815,
     "end_time": "2020-06-23T18:39:52.363119",
     "exception": false,
     "start_time": "2020-06-23T18:39:52.178304",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(hourly_data.info())\n",
    "print()\n",
    "hourly_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.067926,
     "end_time": "2020-06-23T18:39:52.505261",
     "exception": false,
     "start_time": "2020-06-23T18:39:52.437335",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell8\"></a>\n",
    "#### 2.6 Feature encoding\n",
    "\n",
    "The final pre-processing step we will perform will be to handle two of our columns in a special way in order to correctly encode these features. They are:\n",
    "\n",
    "1. `HOURLYWindDirection` - wind direction\n",
    "2. `HOURLYPressureTendency` - an indicator of pressure changes\n",
    "\n",
    "For `HOURLYWindDirection`, we encode the raw feature value as two new values, which measure the cyclical nature of wind direction - that is, we are encoding the compass-point nature of wind direction measurements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:52.665408Z",
     "iopub.status.busy": "2020-06-23T18:39:52.664086Z",
     "iopub.status.idle": "2020-06-23T18:39:52.708472Z",
     "shell.execute_reply": "2020-06-23T18:39:52.707680Z"
    },
    "papermill": {
     "duration": 0.130511,
     "end_time": "2020-06-23T18:39:52.708622",
     "exception": false,
     "start_time": "2020-06-23T18:39:52.578111",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Transform HOURLYWindDirection into a cyclical variable using sin and cos transforms\n",
    "hourly_data['HOURLYWindDirectionSin'] = np.sin(hourly_data['HOURLYWindDirection']*(2.*np.pi/360))\n",
    "hourly_data['HOURLYWindDirectionCos'] = np.cos(hourly_data['HOURLYWindDirection']*(2.*np.pi/360))\n",
    "hourly_data.drop(['HOURLYWindDirection'], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.079907,
     "end_time": "2020-06-23T18:39:52.855383",
     "exception": false,
     "start_time": "2020-06-23T18:39:52.775476",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "For `HOURLYPressureTendency`, the feature value is in fact a `categorical` feature with three levels:\n",
    "* `0-3` indicates an increase in pressure over the previous 3 hours\n",
    "* `4` indicates no change during the previous 3 hours\n",
    "* `5-8` indicates a decrease over the previous 3 hours\n",
    "\n",
    "Hence, we encode this feature into 3 dummy values representing these 3 potential states."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:53.019949Z",
     "iopub.status.busy": "2020-06-23T18:39:53.018825Z",
     "iopub.status.idle": "2020-06-23T18:39:53.195027Z",
     "shell.execute_reply": "2020-06-23T18:39:53.193941Z"
    },
    "papermill": {
     "duration": 0.264309,
     "end_time": "2020-06-23T18:39:53.195280",
     "exception": false,
     "start_time": "2020-06-23T18:39:52.930971",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Transform HOURLYPressureTendency into 3 dummy variables based on NOAA documentation\n",
    "hourly_data['HOURLYPressureTendencyIncr'] = [1.0 if x in [0,1,2,3] else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 0 through 3 indicates an increase in pressure over previous 3 hours\n",
    "hourly_data['HOURLYPressureTendencyDecr'] = [1.0 if x in [5,6,7,8] else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 5 through 8 indicates a decrease over previous 3 hours\n",
    "hourly_data['HOURLYPressureTendencyConst'] = [1.0 if x == 4 else 0.0 for x in hourly_data['HOURLYPressureTendency']] # 4 indicates no change during previous 3 hours\n",
    "hourly_data.drop(['HOURLYPressureTendency'], axis=1, inplace=True)\n",
    "hourly_data['HOURLYPressureTendencyIncr'] = hourly_data['HOURLYPressureTendencyIncr'].astype(('float32'))\n",
    "hourly_data['HOURLYPressureTendencyDecr'] = hourly_data['HOURLYPressureTendencyDecr'].astype(('float32'))\n",
    "hourly_data['HOURLYPressureTendencyConst'] = hourly_data['HOURLYPressureTendencyConst'].astype(('float32'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.110513,
     "end_time": "2020-06-23T18:39:53.410299",
     "exception": false,
     "start_time": "2020-06-23T18:39:53.299786",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell9\"></a>\n",
    "#### 2.7 Rename columns\n",
    "\n",
    "Before saving the dataset, we will rename the columns for readability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:53.563710Z",
     "iopub.status.busy": "2020-06-23T18:39:53.562893Z",
     "iopub.status.idle": "2020-06-23T18:39:53.581651Z",
     "shell.execute_reply": "2020-06-23T18:39:53.578665Z"
    },
    "papermill": {
     "duration": 0.104134,
     "end_time": "2020-06-23T18:39:53.581955",
     "exception": false,
     "start_time": "2020-06-23T18:39:53.477821",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "hourly_data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:53.759685Z",
     "iopub.status.busy": "2020-06-23T18:39:53.755570Z",
     "iopub.status.idle": "2020-06-23T18:39:53.766421Z",
     "shell.execute_reply": "2020-06-23T18:39:53.764474Z"
    },
    "papermill": {
     "duration": 0.094553,
     "end_time": "2020-06-23T18:39:53.766644",
     "exception": false,
     "start_time": "2020-06-23T18:39:53.672091",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# define the new column names\n",
    "columns_new_name = [\n",
    "    'visibility',\n",
    "    'dry_bulb_temp_f',\n",
    "    'wet_bulb_temp_f',\n",
    "    'dew_point_temp_f',\n",
    "    'relative_humidity',\n",
    "    'wind_speed',\n",
    "    'station_pressure',\n",
    "    'sea_level_pressure',\n",
    "    'precip',\n",
    "    'altimeter_setting',\n",
    "    'wind_direction_sin',\n",
    "    'wind_direction_cos',\n",
    "    'pressure_tendency_incr',\n",
    "    'pressure_tendency_decr',\n",
    "    'pressure_tendency_const'\n",
    "]\n",
    "\n",
    "columns_name_map = {c:columns_new_name[i] for i, c in enumerate(hourly_data.columns)}\n",
    "\n",
    "hourly_data_renamed = hourly_data.rename(columns=columns_name_map)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:53.957173Z",
     "iopub.status.busy": "2020-06-23T18:39:53.955625Z",
     "iopub.status.idle": "2020-06-23T18:39:54.051833Z",
     "shell.execute_reply": "2020-06-23T18:39:54.049413Z"
    },
    "papermill": {
     "duration": 0.181441,
     "end_time": "2020-06-23T18:39:54.052044",
     "exception": false,
     "start_time": "2020-06-23T18:39:53.870603",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "print(hourly_data_renamed.info())\n",
    "print()\n",
    "hourly_data_renamed.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:54.238929Z",
     "iopub.status.busy": "2020-06-23T18:39:54.237656Z",
     "iopub.status.idle": "2020-06-23T18:39:54.264152Z",
     "shell.execute_reply": "2020-06-23T18:39:54.265310Z"
    },
    "papermill": {
     "duration": 0.136374,
     "end_time": "2020-06-23T18:39:54.265600",
     "exception": false,
     "start_time": "2020-06-23T18:39:54.129226",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Explore some general information about the dataset\n",
    "print('# of megabytes held by dataframe: ' + str(round(sys.getsizeof(hourly_data_renamed) / 1000000,2)))\n",
    "print('# of features: ' + str(hourly_data_renamed.shape[1])) \n",
    "print('# of observations: ' + str(hourly_data_renamed.shape[0]))\n",
    "print('Start date: ' + str(hourly_data_renamed.index[0]))\n",
    "print('End date: ' + str(hourly_data_renamed.index[-1]))\n",
    "print('# of days: ' + str((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days))\n",
    "print('# of months: ' + str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days/30,2)))\n",
    "print('# of years: ' + str(round((hourly_data_renamed.index[-1] - hourly_data_renamed.index[0]).days/365,2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.092822,
     "end_time": "2020-06-23T18:39:54.484105",
     "exception": false,
     "start_time": "2020-06-23T18:39:54.391283",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"cell10\"></a>\n",
    "\n",
    "### 3. Save the Cleaned Data\n",
    "\n",
    "Finally, we save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:\n",
    "\n",
    "```\n",
    "{'file_name': 'jfk_weather_cleaned.csv',\n",
    " 'message': 'File saved to project storage.',\n",
    " 'bucket_name': 'jfkweatherdata-donotdelete-pr-...',\n",
    " 'asset_id': '...'}\n",
    "```\n",
    "\n",
    "**Note**: In order for this step to work, your project token (see the first cell of this notebook) must have `Editor` role. By default this will overwrite any existing file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-06-23T18:39:54.663073Z",
     "iopub.status.busy": "2020-06-23T18:39:54.661693Z",
     "iopub.status.idle": "2020-06-23T18:39:58.037858Z",
     "shell.execute_reply": "2020-06-23T18:39:58.036838Z"
    },
    "papermill": {
     "duration": 3.476652,
     "end_time": "2020-06-23T18:39:58.038092",
     "exception": false,
     "start_time": "2020-06-23T18:39:54.561440",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "hourly_data_renamed.to_csv(\"data/noaa-weather-data-jfk-airport/jfk_weather_cleaned.csv\", float_format='%g')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.076713,
     "end_time": "2020-06-23T18:39:58.181900",
     "exception": false,
     "start_time": "2020-06-23T18:39:58.105187",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "#### Next steps\n",
    "\n",
    "- Close this notebook.\n",
    "- Open the `Part 2 - Data Analysis` notebook to explore the cleaned dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.098392,
     "end_time": "2020-06-23T18:39:58.364486",
     "exception": false,
     "start_time": "2020-06-23T18:39:58.266094",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "<a id=\"authors\"></a> \n",
    "### Authors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.069249,
     "end_time": "2020-06-23T18:39:58.507345",
     "exception": false,
     "start_time": "2020-06-23T18:39:58.438096",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).\n",
    "\n",
    "Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "papermill": {
   "duration": 38.709418,
   "end_time": "2020-06-23T18:39:58.863383",
   "environment_variables": {},
   "exception": null,
   "input_path": "Part 1 - Data Cleaning.ipynb",
   "output_path": "Part 1 - Data Cleaning-output.ipynb",
   "parameters": {},
   "start_time": "2020-06-23T18:39:20.153965",
   "version": "2.1.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
